Improvement in TF-IDF scheme for Web Pages and its Retrieval Accuracy
نویسندگان
چکیده
In IR (information retrieval) systems based on the vector space model, the tf-idf scheme is widely used to characterize documents. However, in the case of documents with hyperlink structures such as Web pages, it is necessary to develop a technique for representing the contents of Web pages more accurately by exploiting that of their hyperlinked neighboring pages. In this paper, we first propose some methods for improving the tf-idf scheme for a target Web page by using the contents of its hyperlinked neighboring pages, and then compare retrieval accuracy of our proposed methods.
منابع مشابه
On Some Methods for Improving Feature Vectors for Web Pages and their Retrieval Accuracy
In IR (information retrieval) systems based on the vector space model, the tf-idf scheme is widely used to characterize documents. However, in the case of documents with hyperlink structures such as Web pages, it is necessary to develop a technique for representing the contents of Web pages more accurately by exploiting that of their hyperlinked neighboring pages. In this paper, we first propos...
متن کاملA Method of Improving Feature Vector for Web Pages Reflecting the Contents of Their Out-Linked Pages
TF-IDF schemes are popular for generating the feature vectors of documents. These schemes are proposed for characterizing one document. Therefore, in order to characterize Web pages using tf-idf schemes, the feature vectors of the Web pages should be reflected by the contents of Web pages linked with other pages via hyperlinks. In this paper, we propose three methods of generating feature vecto...
متن کاملREINA at WebCLEF2006. Mixing Fields to Improve Retrieval
This paper describes the participation of the REINA Research Group of the University of Salamanca at WebCLEF 2006. The task in that we have participated this year is the Monolingual Mixed Task in Spanish. To select web pages of the EuroGov collection in Spanish, the wide collection was processed with a language guesser, searching for pages in Spanish. All pages in the .es domain were also pre-s...
متن کاملToward improvement of SDR accuracy using LDA and query expansion for SpokenDoc
This paper investigates several techniques for spoken document retrieval, toward improvement of retrieval performance based on the conventional method i.e. TF-IDF. The first approach employs rescaled unigrams of LDA to compute a similarity score. The second technique employs query expansion by web retrieval using Yahoo!API. And the third technique is Prioritized And-operator Retrieval based on ...
متن کاملUtilizing the Subjective Intent of Authoring Formats to Perform Focused Web Crawling
A successful web information retrieval system requires the ability to determine quickly and accurately whether a document or a link should be further explored. Current state-of-the-art web search engines typically use the meta-information in the HTML header to determine the relevancy of the documents. However, many documents on the web do not have such HTML header information. On the other hand...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003